speech separation
Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network
Recent advances in the design of neural network architectures, in particular those specialized in modeling sequences, have provided significant improvements in speech separation performance. In this work, we propose to use a bio-inspired architecture called Fully Recurrent Convolutional Neural Network (FRCNN) to solve the separation task. This model contains bottom-up, top-down and lateral connections to fuse information processed at various time-scales represented by stages. In contrast to the traditional approach updating stages in parallel, we propose to first update the stages one by one in the bottom-up direction, then fuse information from adjacent stages simultaneously and finally fuse information from all stages to the bottom stage together. Experiments showed that this asynchronous updating scheme achieved significantly better results with much fewer parameters than the traditional synchronous updating scheme. In addition, the proposed model achieved good balance between speech separation accuracy and computational efficiency as compared to other state-of-the-art models on three benchmark datasets.
UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture).
Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network
Recent advances in the design of neural network architectures, in particular those specialized in modeling sequences, have provided significant improvements in speech separation performance. In this work, we propose to use a bio-inspired architecture called Fully Recurrent Convolutional Neural Network (FRCNN) to solve the separation task. This model contains bottom-up, top-down and lateral connections to fuse information processed at various time-scales represented by stages. In contrast to the traditional approach updating stages in parallel, we propose to first update the stages one by one in the bottom-up direction, then fuse information from adjacent stages simultaneously and finally fuse information from all stages to the bottom stage together. Experiments showed that this asynchronous updating scheme achieved significantly better results with much fewer parameters than the traditional synchronous updating scheme on speech separation. In addition, the proposed model achieved competitive or better results with high efficiency as compared to other state-of-the-art approaches on two benchmark datasets.
Speech Separation for Hearing-Impaired Children in the Classroom
Olalere, Feyisayo, van der Heijden, Kiki, Stronks, H. Christiaan, Briaire, Jeroen, Frijns, Johan H. M., Güçlütürk, Yagmur
The process includes simulating room and listener acoustic properties (A), modeling talkers' movement trajectories (B), and synthesizing classroom speech mixtures (C). The numbers (1) - (5) correspond to the steps itemized in section II-B more challenging and reflective of classroom acoustics. The separation model is trained to output time-domain waveforms for each speaker with no interference from the other speaker or background noise. This setup enables the model to not only separate overlapping speech, but also to preserve spatial distinctions associated with each moving source. B. Simulation of Overlapping Speech for Classroom Conditions To capture the reverberant and spatial characteristics typical of classroom environments, we developed a spatialization pipeline for generating training and evaluation data (see Fig.1). This pipeline consists of five main components, which are explained below in detail: 1) Simulation of room impulse responses (RIRs) 2) Application of head-related impulse responses (HRIRs) 3) Generation of binaural room impulse responses (BRIRs) 4) Modeling of talkers' movement trajectories 5) Synthesis of the classroom speech data 1) Room Impulse Responses: To simulate naturalistic reverberant classroom acoustics, we generated RIRs that capture direct sound, early reflections, and reverberation or echo. These RIRs were used to spatialize source signals in simulated classroom environments with varying geometry, reverberation, and source-listener distances. We used the Pyroomacoustics Python package [35], which implements the image source method to model sound propagation in rectangular (shoebox) rooms. A total of 30 classrooms were simulated, with dimensions randomly sampled from a range of 8.5 8.5 3 m to 10 10 3.5 m (length width height), reflecting typical U.S. classroom sizes [36], [37].